home *** CD-ROM | disk | FTP | other *** search
- <html>
- <title> Performance Hints for Signal-Processing Applications </title>
- <body>
- <h2> Performance Hints for Signal-Processing Applications </h2>
-
- <p>
- The MIPS processor, particularly the newer ones, can execute
- a substantial amount of real-time signal processing code. But as with
- all processors, writing efficient code requires an understanding
- of the processor and compiler architectures. This list is intended to
- give you some things to think about.
-
- <p>
- First, my <b> disclaimer. </b>
- I've based this list upon a number of "gotchas" that I and
- other audio & image-processing developers have hit. This is by
- no means a complete list, but it's hopefully got some useful hints. It's
- a new list and I'm constantly refining it and adding things. If you
- have your own "gotchas" or suggestions, please send them in!
- <p>
- Thanks to Bruce Karsh, Gints Klimanis, Chris Pirazzi, and Bryan James
- for their review and contributions.
- <p>
- <a href="http://reality.sgi.com/employees/cook/"> Doug Cook </a><br>
- <a href="mailto:cook@sgi.com"> (cook@sgi.com) </a>
- <hr>
- 1. <b> There are few "magic bullets", only guidelines and trade-offs. </b> What
- speeds up one app may slow-down another if the applications are architected
- differently. Take the time to understand and profile your application to
- confirm that these guidelines help or don't help.
-
- <hr>
- 2. <b> Consider the data type you need for each operation. </b> Don't make the
- assumption that integer types are faster than floating-point. This is
- often not true. Here are some gross generalizations about what's best
- for different types of operations:
- <pre>
- Bit-Swizzling & Logical Ops Integer only
- Addition Integer faster than FP
- General Multiplication FP faster than integer
- Multiplication by a Constant Integer is often faster if you can
- write it as a small number of shifts
- & adds. Otherwise FP is faster.
- Division FP is faster. But if you can replace
- division by multiplication, do so;
- multiplication is faster.
- Conditionals Integer is faster
- </pre>
-
- Another thing to consider is that the FP unit has its own register set.
- When you use FP, you effectively get a much larger register set to use, because
- your logical code uses the integer registers, whereas the computations use
- the FP registers. With more registers, some loops may require fewer
- load and store operations.
- <p>
-
- In general, for the typical signal-processing application, which
- requires a lot of multiply/adds of variables, floating-point offers superior
- performance. But as always, it depends upon the architecture of the
- application; your mileage may vary.
-
- <p>
- 2a. Another note: one common speed-up for integer operations is to use
- table-lookup, where possible. This often works well for image processing,
- where the input is 8- or 10-bit data and the tables are small enough to
- stay in the cache. For audio operations, table-lookup is less common, since
- audio data is typically 16-24 bits.
-
- <hr>
- 3. <b> Avoid unnecessary type conversion. </b> This is a trade-off with #1.
- Conversion between integer and floating-point can be expensive
- (particularly FP->int, which sometimes requires limiting). If your
- application accepts and produces integer data and only does a small
- amount of signal-processing, the cost of conversion to FP may outweigh
- the gains in performance from using FP.
-
- <p>
- Also, be aware: in some cases, the compiler may do type conversion
- "behind your back," particularly by promoting single-precision to
- double-precision floating-point (hint #7).
-
- <hr>
- 4. <b> Consider the instruction set (ISA) you want to use. </b>
- If you're writing in C, selecting an instruction set requires just a
- compiler flag,
- e.g. -mips2. There are currently 4 MIPS instruction sets (MIPS 1-4).
- Each set is a superset of the previous. Roughly speaking, here's how they
- work:
- <pre>
- MIPS 1 R3000
- MIPS 2 R4000 instructions
- MIPS 3 R4000 -- introduces 64-bit operations
- MIPS 4 R8000/R10000 instructions
- notably floating-point multiply/add, conditional FP move
- </pre>
-
- <p>
- I've noticed that one big benefit of going from MIPS1->MIPS2 is a
- reduction in the cost of FP->integer conversion. For MIPS1, the
- compiler will generate some inline code to perform this; for MIPS2, the
- code is replaced by a single instruction. In general, though, you
- should not expect a huge performance gain going mips1->mips2.
-
- <p>
- The MIPS4 instruction set provides some nice opportunities for
- signal-processing applications. It provides an FP multiply/add
- instruction and floating-point conditional-move instructions. These
- last allow one to write branch-free limiting code, which is a big
- win in some cases (see #5).
-
- <p>
- With each instruction set, you may get a performance gain at the cost
- of compatibility with older MIPS-architecture processors. In other
- words, if you compile -mips2, you may see some speedup, but you lose
- R3000 compatibility. This is a trade-off you will have to consider.
- Some people maintain multiple versions of a binary using different
- ISA's. (If your application uses "inst" for installation, you can have
- "inst" automatically select the appropriate binary for the end-user's
- machine).
-
- <hr>
- 4. <b> Disassemble your code and look at it. </b> Use the "dis" command on your
- actual binary, not the code generated by "cc -S" (the
- compiler lies, often substituting higher-level macros for more complex
- things that actually occur in the binary itself).
-
- <p>
- There's really no substitute for this. Looking at the source-code
- and thinking about the algorithms at a high-level is a first step, but
- in the end if you need to squeeze every bit of performance out of
- your code, you need to examine the actual instructions that the CPU
- will execute.
-
- <hr>
- 5. <b> Avoid branches where possible in critical code. </b> If you can replace a
- bit of frequently executed code by an equivalent branch-free piece of
- code, it may run in fewer cycles though it has more instructions. This is
- because the MIPS processors are pipelined, and for some of them, a branch
- requires several extra cycles to refill the pipeline.
-
- <p>
- Another reason to avoid branches is that many of the compiler optimizations
- are limited to basic blocks, i.e. the region between two branches. Removing
- branches often gives the compiler a better chance to optimize the code.
- Finally, branches inside loops reduce the effectiveness of loop unrolling.
-
- <hr>
- 6. <b> Work with the cache. </b> Though the processor is very fast, the memory
- system is relatively slow. If you don't get good cache performance,
- your application will suck wind. A good rule of thumb is the
- order-of-magnitude rule. Think of each successive level of cache, going
- from the processor out to main memory, as 10 times as slow as the previous
- level. Suppose an instruction which hits the primary cache takes 1 cycle.
- The same instruction may require 10 cycles if it misses the primary cache
- and hits the secondary cache, and 100 cycles if it misses the secondary cache
- and requires a cache-line fetch from main memory. The effect is more
- pronounced as processors become faster. Cache effects can thus make a
- huge difference in program performance.
-
- <p>
- Organize your code and your data to optimize your use of the cache.
- Minimize your inner loops and working sets of data. For example, in
- image-processing, "tiling" is a common practice. Suppose a number of
- consecutive operations are to be performed on an image. The "obvious" way
- would be to perform each operation on the entire image, then proceed to
- the next operation. However, assuming the image doesn't fit in the cache,
- this gets the worst possible cache performance, since the image has to
- be reloaded from main memory every time it is touched. A far superior
- approach is to operate on small "tiles" which fit in the cache: perform
- all the operations on one tile, then proceed to the next.
-
- <p>
- There are lots of other tricks to improve cache performance, but hopefully
- this short example will get you thinking. If you need data in a critical
- loop, try to make sure it's in the cache every time the loop is executed.
-
- <hr>
- 7. <b> Watch out for floating-point promotion. </b> Double-precision arithmetic
- is typically more costly than single-precision, so if you don't need
- it, make sure you're not using it. In some cases the compiler will
- automatically promote operands to double-precision "behind your back."
- Here are some things to look for. Check out the "-float" option on the
- C compiler, which instructs the compiler (in K&R mode) to avoid
- promoting operands to double-precision. It's also a very good idea to
- use function prototypes with functions taking "float" arguments --
- otherwise the arguments will be promoted to double-precision (even with
- the -float option). Finally, if a floating-point constant is to be
- single-precision, it should be given in the form "0.0f", i.e.
- explicitly single-precision, to avoid promotion to double-precision.
-
- <hr>
- 8. <b> Don't assume that assembly language is inherently better than
- C code. </b> Write your algorithm in C unless you find by disassembly
- that the compiler is missing some possible optimizations, or
- unless there's some special trick you need that can really only
- be expressed in assembly. In most cases you'll find that the compiler
- does a really good job.
-
- <hr>
- 9. <b> Don't assume that pointer arithmetic is inherently better than
- array indexing. </b> The normal addressing mode of the MIPS processor (a fixed
- offset from a register) lends itself nicely to array indexing. Consider
- the following two loops:
- <pre>
- for (i=0; i < 50; i++) {
- array[i]=0;
- }
-
- int *p=array;
- for (i=0; i< 50; i++) {
- *p++=0;
- }
- </pre>
- <p>
- Both are functionally equivalent if the value of p is not used later
- in the second case. Some people shy away from the first approach on
- the mistaken belief that the address generation for the array indexing
- is computationally expensive. In fact, the compiler generates code for
- the first version which has 20% fewer instructions per iteration than the
- code for the second version. Why? The compiler often generates better code
- for simpler expressions, because it can better understand what the code
- is trying to accomplish. In the first case, the compiler sees that the code
- is looping on an array index. It calculates the end address of the array
- (200 from the base, since it is an integer array), and generates a loop which
- has only one variable -- the address in the array. "i" as such is nowhere to
- be seen. The loop requires four instructions per iteration:
- <pre>
- [test1.c: 5] 0x4009a8: 248500c8 addiu a1,a0,200
- [test1.c: 5] 0x4009ac: 24840004 addiu a0,a0,4
- [test1.c: 5] 0x4009b0: 0085082b sltu at,a0,a1
- [test1.c: 5] 0x4009b4: 1420fffd bne at,zero,0x4009ac
- [test1.c: 6] 0x4009b8: ac80fffc sw zero,-4(a0)
- </pre>
- <p>
- The second example is too "smart" for the compiler.
- The compiler can't quite figure out what the programmer is trying
- to accomplish, so it generates two loop variables, one for "i" and
- one for "p" (even though "p" is never used after the loop). This
- example requires 5 instructions per iteration, since it has to
- increment two loop variables.
- <pre>
- [test2.c: 4] 0x4009a8: 24020032 li v0,50
- [test2.c: 6] 0x4009ac: 00002025 move a0,zero
- [test2.c: 6] 0x4009b0: 24840001 addiu a0,a0,1
- [test2.c: 6] 0x4009b4: 0082082a slt at,a0,v0
- [test2.c: 7] 0x4009b8: ac600000 sw zero,0(v1)
- [test2.c: 6] 0x4009bc: 1420fffc bne at,zero,0x4009b0
- [test2.c: 7] 0x4009c0: 24630004 addiu v1,v1,4
- </pre>
- <p>
- (if you're not familiar with MIPS assembly, note that the instruction
- after the branch [bne] is executed before the branch actually occurs.
- This is the so-called "branch delay slot").
- <p>
- Some folks have tested numerous ways of stating loops to see for
- which the compiler generates the fastest code. This is an interesting
- experiment, but it may not hold true from compiler release to compiler
- release. However, the general principle holds: don't be afraid to state your
- loops in simple terms. Often the compiler does best with the simplest
- constructs.
- <hr>
- 10. <b> Watch out for expressions hidden in macros. </b> Know what's a macro
- and what's a function. Expressions hidden in macros will sometimes
- be evaluated multiple times; this causes a performance hit if it's
- not required for correct evaluation of the macro.
- <hr>
- 11. <b> Consider unrolling your loops. </b> Often the compiler will do this
- for you, but you can sometimes glean a little more performance by doing
- it yourself. Loop unrolling works as follows. Consider the following loop:
- <pre>
- for (i=0; i < x; i++) {
- array[i]=0;
- }
- </pre>
- <p>
- We've seen above (#9) that the compiler generates somewhat smart code
- for this. For each iteration of the loop, the pointer is incremented
- and tested against its end value. But we can cut down on the number of
- instructions per array item if we write the loop like this:
- <pre>
- for (i=0; i < x; i+=4) {
- array[i]=0;
- array[i+1]=0;
- array[i+2]=0;
- array[i+3]=0;
- }
- </pre>
- <p>
- This loop is faster than the first one. The trick is that the pointer
- increment and test-against-end-value need only be done for every 4
- array elements, instead of for every array element. There's no extra
- computation overhead associated with the (i+n) indices, because these
- make use of the MIPS addressing mode: the compiler can determine the
- fixed offset (represented by n) from the base address represented by
- i. Also, since there are fewer iterations, there are fewer branches.
- As noted above, a pipelined processor likes to execute instructions
- linearly (#5). The compiler-generated code looks like:
- <pre>
- [test3.c: 6] 0x4009a8: 248500c8 addiu a1,a0,200
- [test3.c: 6] 0x4009ac: 24840010 addiu a0,a0,16
- [test3.c: 6] 0x4009b0: 0085082b sltu at,a0,a1
- [test3.c: 7] 0x4009b4: ac80fff0 sw zero,-16(a0)
- [test3.c: 8] 0x4009b8: ac80fff4 sw zero,-12(a0)
- [test3.c: 9] 0x4009bc: ac80fff8 sw zero,-8(a0)
- [test3.c: 6] 0x4009c0: 1420fffa bne at,zero,0x4009ac
- [test3.c: 10] 0x4009c4: ac80fffc sw zero,-4(a0)
- </pre>
- <p>
- Note that this snippet assumes that x is divisible by 4, unlike the
- original loop. There's a way around this for general values of x:
- split the loop into two loops. The first is not unrolled, and processes
- (x & 3) elements. The remaining number of elements is guaranteed to be
- divisible by 4, and the second, unrolled loop handles the bulk of the
- elements. If x is going to be very small, it's not worth the overhead,
- but for most values of x, this approach is a big win.
- <p>
- So why did we choose 4? First, it's a power of two, which makes the
- modulo operation cheap (x & 3, as seen above). It's also fairly small.
- As the loop-unrolling block size becomes larger, you get diminishing
- returns. At some point, things actually slow down because the code
- no longer gets nice cache behavior. Common values for the block size
- are 4, 8, and sometimes 16.
- <p>
- Experiment with your code to see what works well. As always, if it's
- a critical loop, disassemble it to see what the compiler is doing for
- you.
- <hr>
- 12. <b> Watch out for floating-point exceptions. </b> These can really hurt
- your performance. Check out the <a href="fp.underflow"> excellent little blurb </a> that Chris
- Pirazzi has written, available from the audio apps page.
- <p>
- [...more to come...]
- </body>
-